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Abstract 

We introduce a Markov model for the evolution of a gene family 
along a phylogeny. The model includes parameters for the rates of 
horizontal gene transfer, gene duplication, and gene loss, in addition 
to branch lengths in the phylogeny. The likelihood for the changes in 
the size of a gene family across different organisms can be calculated 
in 0(N + hM 2 ) time and 0(N + M 2 ) space, where N is the number 
of organisms, h is the height of the phylogeny, and M is the sum of 
family sizes. We apply the model to the evolution of gene content 
in Preoteobacteria using the gene families in the COG (Clusters of 
Orthologous Groups) database. 
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1 Introduction 



At this time, 257 microbial genomes are sequenced, about twice as many 
are soon to be completed, and 20 complete eukaryotic genomes are pub- 



licly available (http://www.ncbi.nlm.nih.gov/Genomes/). These numbers 



continue to grow in an exponential pace with advances of technology and 
savvy ((Green 2001)) . The wealth of genome sequence data already caused a 
revolution in molecular evolution methods (Wo lfe and Li 200"3| IDelsuc et al. 2 005). 
A few years ago, scientific studies have necessarily focused on nucleotide-level 
differences between orthologous genes, mainly because of the technical and fi- 
nancial limitations on DNA sequence collection. With the increasing amount 
of whole genome information it becomes possible to analyze genome-scale dif- 
ferences between organisms, and to identify the evolutionary forces respon- 
sible for these changes. In particular, sizes of gene families can be compared 
with the aim of better understanding adaptive evolutionary mechanisms and 
organismal phylogeny. Several studies point to the fact that gene content 
may carry sufficient phylogenetic signal for the construction of evolution- 
ary trees (|Fitz- Gibbon and" Hou se 1999: ISnel et al. 19991 IT^kaia et al. 19991 
ILin and Gerstein 20001 1( Clarke et al. 20021 iKorBel et al. 20021 IDutilh et al. 20041 
Gu and Zhang 20041 IHuson and Steel 20041 ILake and Rivera 2004j) . Com- 



parative analyses of genome-wide protein domain content (|Lin and Gerstein 2000: 
[Yang et al. 20051 IDeeds et al. 2005]) have also provided important insights 
into evolution. Gene content and similar features have been used to con- 
struct viral ( |Montague and Hutchison HI 2 000 



bial (S nel et al. 1999| IFitz-Gibbon and House 



IHerniou et al. 2001jl . micro- 



999(|Gu and Zhang 2004| ), and 



universal trees ()Tekaia et a l. 1999| ISTmonson et al. 2005) |Yang et al. 20 05). 



Comparative gene content analysis is also used to estimate ancestral genome 
composition QSnel et al. 20021 IMirkin et al. 2003|) . The presence-absence 
pattern of homologs in different organisms, the so-called phyletic pattern 
( Koonin and Galperin 2002] ITatusov et al. 2003j) , provides clues about gene 



function ( |Pellegrini et a l. 1999 ) and evolution of metabolic pathways (|Mirkin et al. 2003j) 



There are two principal methodological issues in the analysis of gene con- 
tent. First, one needs to decide how homologous gene are selected, and 
secondly, an appropriate computational technique must be chosen to ana- 
lyze the data. A possible choice for compiling a data set is to use pairwise 
orthologs and compute a score for each pair of genomes ()Snel et a l. 1999 
ITekaia et al. 19991 iKornel et al. 20021 IClarke et al. 2002UDutilh et al. 20041) . 
The matrix of pairwise scores of shared gene content is then amenable to 
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analysis by distance-based methods of phylogeny construction. An alterna- 
tive is proposed in ()Lake and Rivera 2004|) : for each pair of genomes, the 
presence or absence of homologs with respect to a third reference genome 
is noted, and the pairs of presence-absence sequences are used to calculate 
a pairwise distance matrix. A third approach is to first compute families 
of homologous genes, and then to record the absence or presence of each 
family in the genomes (jFitz-Gibbon and HouselMl lL m ana Gerstein 2000; 
Uordan et al. 20011 [Wolf et al. 2001]) . The resulting absence -presence data 
can be treated as a set of 0-1 sequences and further analyzed with traditional 
parsimony or distance-based methods. Some specialized parsimony meth- 
ods have been developed for the direct purpose of analyzing gene absence- 
presence data (IMirkin et al. 2003| IKunin and Ouzounis 2003|) . Instead of 
simply noting presence-absence, the gene family sizes give an even richer sig- 
nal for evolutionary analyses (|Snel et al. 20021 IHuson and Steel 2004( IHahn et al. 20 05 ) . 
It is that latter type of data that we model here. 

A number of processes shape the gene content of an organism. New genes 
may be created by duplication of an existing gene, horizontal transfer from 
a different lineage, and fusion/fission fS nel et al. 2002)) . It has been widely 
debated how the extent of horizontal gene transfer (HGT) compares to ver- 



tical inheritance (| Jordan et al. 2001tlSnel et al. 2002 


Gogarten et al. 2002 


IKurland et al. 2003NKunin et al. 20051 IGe et al. 2005 


ISimonson et al. 2005) 



It is clear that horizontal gene transfer plays a major role in microbial evolu- 
tion (|Boucher et al. 2003|) . but there is still need for adequate mathematical 
models in which that role can be measured. 

This paper describes a probabilistic model for gene content evolution. 
Specifically, we model the evolution of gene families along a phylogeny. The 
model includes gene duplication, gene loss, and horizontal transfer as mech- 
anisms that determine gene family evolution. We also show how to com- 
pute exact likelihoods for gene family sizes in different organisms. A few 
probabilistic models were proposed for gene content evolution, but they are 
less general than ours. Usual stochastic models work with two parameters. 
Gu and Zhang ( Gu and Zhang 2004| IGu et al. 2005|) rely on a model that 



includes gene loss and gene duplication but no other modes of gene gen- 
esis. They showed how gene family sizes can be used to define additive 
distances in such a model. Interestingly enough, the data can be reduced to 
a three-letter alphabet for the purposes of distance calculations: only 0, 1 
or "many" homologs per family need to be counted. The distance relies 
on an estimate of the rate parameters, which is obtained through likeli- 



2 



hood optimization. Hahn et al. f H ahn et al. 200 5) developed an alterna- 
tive likelihood-based approach for the same two-parameter model. Huson 
and Steel (|Huson and Steel 2004j) analyzed a two-parameter model that ac- 
counts for gene loss and horizontal transfer but not for gene duplication. 
They derived a distance measure based on gene family sizes using likelihood 
maximization arguments. They further showed that other distance mea- 
sures based on shared gene content (|Snel et a l. 1999) have inferior accuracy 
in phylogeny reconstruction than either Dollo parsimony or their own dis- 
tance measure. Karev et al. developed a rich probabilistic model of gene 
content evolution in a series of papers (|Karev et al. 20021 IKarev et al.~2 003 ; 
IKarev et al. 2004)) . The model explains the distribution of gene family sizes 
found in different organisms. It is, however, too general for exact detailed 
calculations, and for likelihood computations in particular. 

To our knowledge, no tractable stochastic model was introduced yet that 
accounts for horizontal transfer, gene loss, and duplication. These processes 
cannot be modeled by using only two parameters because the intensity of 
gene loss and duplication depend on the size of a gene family, but the tempo 
by which genes are acquired through horizontal transfer has a constant com- 
ponent. Among other applications, a model that accounts for duplication 
and transfer is useful in analyzing the evolution of metabolic networks: do 
new paths evolve by gene duplication and adaptive selection, or by accom- 
modating genes with new functions via horizontal gene transfer? This paper 
introduces a probabilistic model for the evolution of homologs on a phylogeny. 
Specifically, we model the evolution of a single gene family along the phy- 
logeny, where different processes may add new genes to the family or erase 
members of it, and arrive at the family sizes observed at the terminal taxa. 
We provide an algorithm that can compute analytically the likelihood of gene 
family sizes in different organisms, given an evolutionary tree. The algo- 
rithm calculates the likelihood of family sizes in 0(N + M 2 h) time where M 
is the total number of genes in the family, N is the number of genomes, 
and h is the height of the tree. The tree height is at most linear in N, and 
on average, it is 0(y/~N) or O(logiV) for uniform or Yule-Harding distribu- 
tion of random trees. In contrast, the methods of (Gu and Zhang 2004) 
and (|Huson and Steel 2004)) compute distances between every pair of or- 
ganisms, which takes quadratic time in N. The likelihood calculations of 
(|Hahn et al. 2005j) take cubic time in M, and involve the evaluation of infi- 
nite sums that are truncated heuristically. 

The article is organized in the following manner. Section |21 introduces 
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our stochastic model of gene content evolution, and describes formulas for 
computing various associated probabilities, including likelihood. The formu- 
las are used in an algorithm described in Section El Section 0] describes our 
initial experiments in modeling gene content evolution in 51 Proteobacte- 
ria and 3555 gene families from the COG (Clusters of Orthologous Groups) 
database (|Tatusov et al. 2003|) . Section concludes the paper. 



2 Mathematical model 

Let T be a phylogenetic tree over a set of species S. The tree T is a rooted 
tree with vertex set V(T) and edge set E(T), in which leaves are bijectively 
labeled with elements of S. Every edge e has a length t e > 0. We are 
interested in modeling the evolution of a gene family. The family size changes 
along the edges: genes may be duplicated, lost, or gained from an unknown 
source. We model the evolution of gene counts (family size) at the tree 
nodes: the gene count at every node u G V(T) is a random variable xi u ) 
that can take non-negative integer values. In addition to the tree with its 
edge lengths, three parameters determine the joint distribution of the gene 
counts: a duplication rate A, a loss rate fi, and a transfer rate k. The loss 
rate accounts for all possible mechanisms of gene loss, including deletion and 
pseudogenization. The transfer rate accounts for processes of gene genesis, 
including HGT from another lineage in the same tree, or HGT from an 
unknown organism. 

In our model, the evolution of the gene counts on a branch follows a 
linear birth-and-death process (jFeller 1950|) parametrized by A, k, and \i. 
Let {X(t) : t > 0} denote the continuous-time Markov process formed by 
the gene counts along an edge uv: x( u ) = -^(0) an d x( v ) — X(t uv ). The 
transition probabilities of the process are the following: 

P j X(t + St) = n + 1 X(t) = n\ = (k + n\\ St + o(5t) 
pjx(t + St) = n- 1 X(t) =nj = n/iSt + o(St) 
¥^\X(t + St) -n\ > 1 X(t) = n^ = o(5i). 

In other words, every existing gene produces an offspring through duplication 
with an intensity of A, or disappears with an intensity of fi, and new genes are 
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Figure 1: Galton- Watson forest that shows the evolution of genes in the same 
family along a tree edge. The top line represents the ancestral genome with 
three genes; the bottom line represents the descendant genome, in which 
there are five family members. Symbol o represents the source from which 
genes might be transferred horizontally, symbols * represent copies of the 
gene in the genome at the beginning and the end of the investigated time 
span t. Each o or * in the ancestral genome is the root of a Galton- Watson 
tree. Note that the physical order of genes in the genomes is immaterial: 
here they are simply drawn next to each other for clarity. 



acquired with an intensity of k, independently from the number of existing 
genes. 

The histories of individual genes on an edge form a Galton- Watson forest, 
see Figure ^ The figure illustrates a scenario where the gene family increases 
from three to five genes. The counts at the branch endpoints are the result 
of many duplication, transfer and loss events. The change involves three 
horizontally transferred genes, from among which one survives, another one 
does not, and the third one produces two surviving paralogs. 

While it is not too difficult to calculate the probabilities for any particular 
gene count on a branch (see £ J2.1J) . the likelihood L of observed gene counts at 
the leaves involves an infinite number of possible gene counts at intermediate 
nodes: 

L= Yl TOroot) II F \x(y)= r m y X(x)=m x \, (1) 

(m x :xeV(T)) xy£E(T) 1 J 

where 7(-) is the distribution at the root, and the summation over the {m x ) 
vectors takes all values in agreement with the gene counts at the leaves in the 
input data. Our main technique for computing the likelihood is to restrict 
the computation to genes that have at least one surviving descendant at the 
leaves. In what follows we develop the formulas to compute the likelihood. 
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2.1 Basic transition probabilities 

First we analyze the blocks of homologs at a node that have a common 
origin. One block is formed by the genes that trace back to a horizontal 
transfer event on the branch from the parent. Each other block is the set of 
paralogs with the same ancestor at the parent. The homologs in Figure Q 
form four blocks: a block of size three that comprises the descendants of the 
horizontally transferred genes, a block of size zero for the deceased parental 
gene, and two blocks of size one. The independent birth-and-death processes 
associated with the blocks have been thoroughly analyzed in the statistical 
literature. 

Definition 1. Define the following basic transition probabilities for gene 
count evolution on a branch. Let h t (n) denote the probability that there are n 
genes of foreign origin [not inherited from the parent] after time t. Let gt{n) 
denote the probability that a single gene has n copies after time t. 

Theorem 1. The basic transition probabilities can be written as follows. 

h t (n)=^ + n - 1 )(l-XP(t))HxmT (2) 



n 



where (3(t) = * ^-frA)t , and 




n 



Furthermore, 

(n) = i^ {t) */n = 0; 

\(l-/i^))(l-A^))(A^)) n 1 z/n>0. 

Proof. The size of the HGT block of homologs follows a birth-and-death 
process with constant rate k of immigration and no emigration. The tran- 
sition probabilities of (J2J) for such a process were analyzed by Karlin and 
McGregor ( |Karlin and McGregor 1958 ). Blocks of paralogs evolve by a sim- 



ple birth-and-death process: the transition probabilities of (jSJ) are derived 
in, e.g., ()Feller 1950jl . □ 
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2.2 Gene extinction and survival 



Definition 2. A surviving gene at a node x is such that it has at least one 
modern descendant at the leaves below x. 

Let D x denote the probability that a gene present at node x is not sur- 
viving, i.e., that it has no modern descendants. 

Lemma 1. The extinction probability D x can be calculated as follows. If x 
is a leaf, then D x = 0. Otherwise, let x be the parent of x\, X2, ■ ■ ■ , %k- 

d x = n(W,) + (1 - - x ^)) l _xp(t 3 )D x ) (4) 

where tj is the length of the branch leading from x to Xj . 

Proof. For leaves, the statement is trivial. When x is not a leaf, condition 
on the gene count at xf 

k 00 

D x = m2g tj (m)(D Xj ) m . 
j=l m=0 

Plugging in gt{m) from Eq. (j3J) and replacing the infinite series with a closed 
form gives (jlj). □ 

2.3 Effective transition probabilities 

We introduce two new probabilities, denoted by H x {n) and G x (n). They ac- 
count for the number n of surviving genes at node x, either acquired through 
horizontal transfer, or through duplication and loss from a single gene. The 
effective transition probabilities are related to h t (n), and gt{n), but account 
for eventual extinction below node x. A formal definition follows. 

Definition 3. Let y be a non-root node and x its parent. Define the following 
effective transition probabilities. Let H y {n) denote the probability that at 
node y there are n surviving genes of foreign origin, i.e., that have no ancestor 
at x. Let G y (n) denote the probability that a single gene at x has n surviving 
copies at node y. 
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Lemma 2. Let y be a non-root node, let x be its ancestor, and let t be the 
length of the edge xy. The effective transition probabilities can be written as 
follows. 

^w^+r^M^wi^^y (5) 



l-D y \(3(t)J \ 1-D y \f3(t) J 



r <m , ^-nWW- D v) , fi , 

M j (w)) (i - d„w)) v 1 - D v m) ) ' ( } 

Proof. We condition on the number of genes at node y (whether or not they 
survive) . 



i=0 \ 



Using Eq. (J2J) leads to an infinite series that can be simplified to get ©• 
Similarly, write 



00 'n + i 



i=0 



Taking the values of ^(n + i) from Eq. (jSJ) and simplifying the resulting 
infinite series yields (JSJ). □ 



2.4 Number of surviving genes on a branch 

Definition 4. Let y be a non-root node, and letx be its ancestor. Let p y (m\n) 
denote the survival probability defined as the probability of the event that 
there are m surviving genes at node y under the condition that there are n 
genes at node x (not necessarily surviving). 
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Lemma 3. The survival probabilities can be computed as follows. 

p y (m\0) = H y (m) (7a) 

p y (0\n) = H y (0)(G y (0)) n 0<n (7b) 

p y (l\n) = G v (0)p y (l\n-l)+G v (l)p v (0\n-l) < n (7c) 

p y (m\n) = apy(m — l\n) < n, 1 < m (7d) 

+((?„(!) - aG y (0)) P y(m - l\n - 1) 



+Gy(0)p y (m\n - 1) 



where 



(i - p y )\m 

l-D y XP{t) ' K } 



Proof. For p y (m\0) and ^(Oln), the equations are straightforward. Other- 
wise, we condition on the surviving copies of a single gene at y: 

m 

p y (m\n) =^2Gy(i)p y (m- i\n- 1). (9) 

8=0 

Now, using that G y {i + 1) = aG y (i) whenever i > 0, and comparing Q 
for py(m\n) and p y (m — l\n), we can write p y (m\n) in a recursive form as 
shown. □ 



2.5 Conditional likelihoods 

Definition 5. Let x be a node in the tree. Define the conditional likelihood 

Lxin) for all n as the probability of having the observed gene counts at the 
leaves in the subtree rooted at x, under the condition that there are n surviving 
copies at x. 

Theorem 2. The conditional likelihoods can be calculated as follows. In the 
case when x is a leaf, L x (n) — lifnis the observed gene count at x, otherwise 
the likelihood is 0. If x is not a leaf, and has children X\, X2, ■ ■ ■ , Xk, then the 
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following recursions hold. 

L x (0) = II E p Xj (m\0)L Xj (m); (10a) 

j=l m=0 

( k 

L x (n) = (1 - D x y n [ J] ]T p Xj (m\n)L Xj (m) 

\j=l m=0 

' E {D x ) n ~\l - D x yL x {i)) ; < n < ]T M„ (10b) 

where Mj is the sum of gene counts at the leaves in the subtree rooted at Xj. 
Ifn> Ej=i Mj, then L x {n) = 0. 

Proof. For a leaf node, or for n > Y,j=i Mj, the theorem is trivial. Otherwise, 
consider the likelihood £ x (n) of the observed gene counts at the leaves in the 
subtree rooted at x, conditioned on the event that there are n genes present 
at x, which may or may not survive. We write the likelihood in two ways. 
First, by conditioning on the number of surviving genes at the children, 

k Mj 

L{n) = II E Pxj{rn\n)L X:i (m). (11) 

j=l m=0 

Secondly, by conditioning on the number of surviving genes at x, 

4(n) = E h (D x ) n ~\l - D x ) l L x {i). (12) 



i=0 



Now, rearranging the equality of the two right-hand sides gives the desired 
result. □ 

Remark. Clearly, the gene counts M x of Theorem|2]are easily computed 
for all x. If m(x) is the gene count for every leaf x then 



m{x) if x is a leaf; 

Y!j=i M Xj if xi, . . . , Xk are the children of x. 



M*={^\. . c ' .,, (13) 
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2.6 Likelihood 



It is assumed that the family size at the root is distributed according to the 
equilibrium probabilities: 

Theorem 3. Let M be the total number of genes at the leaves. The likelihood 
of the observed gene counts equals 

L = 2^ L root (n) ± \ s +T1 — ■ ( 15 ) 

' 1 " jAoot 

Proof. By summing the likelihoods conditioned on the surviving genes at the 
root, 

M 00 fn + i\ 

L = £ L root (n) £ 7 (n + i) + (£>«*)*(!- 

n=0 i=0 \ 1 / 

Now, plugging in the values of 7(-) from Eq. (|14|) and replacing the infinite 
series by a closed form gives the theorem's formula. □ 



3 Algorithm 

This section employs the formulas of Section El in a dynamic programming 
algorithm to compute the likelihood exactly. More precisely, the algorithm 
computes the likelihood of gene counts at the tree leaves, given the dupli- 
cation rate A, the transfer rate k, and the loss rate Algorithm Com- 
puteLikelihood below proceeds by a depth-first traversal; the necessary 
variables are calculated from the leaves towards the root. Let m(u) denote 
the gene count at every leaf u. 
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ComputeLikelihood 

Input A, k, fx, T, gene counts m(u) : u is a leaf of T 
Output likelihood of the m(-) values 

1 for each node x G V(t) in a depth- first traversal 

2 Compute D x using Eq. (jlj). 

3 Compute the sum of gene counts M x by Eq. (|T3|) . 

4 if a; is not the root then 

5 Let y be the parent of x. 

6 for n = 0, . . . , My do 

7 for m = 0, . . . , M x do compute p x (m\n) by Eq. (JJJ). 

8 for n = 0, . . . , M x do compute L x (n) by Eq. (fTUJl. 

9 Compute the likelihood L at the root using Eq. (|15|). 

10 return L. 

Theorem 0] below analyzes the algorithm's complexity in terms of the 
topology of T. In particular, it uses the notions of height of a node x, defined 
as the number of edges on the path leading from the root to x, levels of nodes, 
which are sets of nodes with the same height, and height of the tree, which 
is the maximum of the leaf heights. 

Theorem 4. Let h be the height ofT in Algorithm COMPUTELIKELIHOOD, 

and N the number of its leaves, and let M = M mot be the sum of gene counts. 
The algorithm can be implemented in such a way that it uses 0(N + M 2 ) 
space and runs in 0(N + hM 2 ) time. 

Proof. Computing D x and M x takes 0(1) time when x is a leaf, or 0(k) for 
an inner node with k children. There are O(N) nodes in the tree and, thus, 
computing D x and M x for all x is done in O(N) time. The computed values 
are stored in 0(N) space. 

In order to analyze the computations in Lines EHHl we consider nodes at 
the same level. Line |S1 computes L x (n) for n = 0, . . . , M x in 0(M 2 ) time. 
Lines |3H3 compute p x (m\n) for (M x + l)(M y + l) pairs of n, m values. (Notice 
that H y (m) can be computed in 0(1) time for each m in the iteration over m 
using that H y {m) = a m+K J i X ~ 1 H y (m — 1) with the a of Eq. (jHJ).) For the 
children x\, . . . ,Xk of the same node y, the total time spent in Lines EH3 
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is O(My). Hence, the time spent on computing values for all nodes at the 
same level k is 

o( E M i+ E mA. 

all y at level fc — 1 all a; at level fc 

Clearly, J2 X Mr < M if the summation goes over x for which their subtrees do 
not overlap, such as nodes at the same level. Now, J2 X M 2 < (J2 X M x ) 2 < M 2 , 
and, thus, 0(M 2 ) time is spent on each level. Therefore, the total time spent 
in the loop of LineHis -0(hM 2 ). LineEJtakes 0{M) time. 

In order to obtain the space complexity result, notice that at the end 
of the loop in Line |H1 the computed variables for the children of x are not 
needed anymore. Therefore, the nodes for which p x {-\-) is needed are such 
that their subtrees do not overlap. By the same type of argument as with 
time spent on a level, the number of variables that need to be kept in memory 
is 0(M 2 ). □ 



4 Gene content evolution in Proteobacteria 

Proteobacteria form one of the most diverse groups of Prokaryotes. Pro- 
teobacteria are an excellent model case for studying genome content evo- 
lution: they include pathogens, endosymbionts, and free-living organisms. 
Genome sizes vary tenfold within this group, and horizontal transfer is abun- 
dant flGogarten et al. "20 02). Their phylogeny is still not resolved to satisfac- 
tion ()Lerat et al. 20031 iBoussau et al. 20041 IHer beck et al. 20051 IBelda et al.^20051) . 
We used 51 Proteobacteria in the first application of our likelihood method. 
Gene counts were based on the newer version (jTatusov et al. 2003j) of the 
COG database. Each COG is a manually curated protein family of ho- 
mologs. The COGs are classified into 23 functional categories. For each of 
our 51 Protebacteria, the number of genes in each COG family was estab- 
lished. There are 3555 COG families that have at least one member in the 
organisms. (The organisms are listed in the Appendix.) The data set was 
provided to us by Csaba Pal and Martin Lercher (IPal et al. 2005j) . The pur- 
pose of applying the likelihood method was not to carry out in-depth data 
analysis, but rather to get a first impression of our method's performance on 
realistic data. 

First we optimized the branch lengths and the A, k parameters while 
keeping \i — 1.0 to fix the scaling. In a second pass, we clustered the COG 
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Figure 2: Rates in different groups and the distribution of COG functional 
categories. The functional categories are: J-translation, K-transcription, 
L-replication and repair, D-cell cycle control and mitosis, V — defense mech- 
anisms, T-signal transduction, M-cell wall /membrane/envelope biogenesis, 
N-cell motility, U-intracellular trafficking and secretion, O-posttranslational 
modification, protein turnover and chaperones, C-energy production and 
conversion, G-carbohydrate transport and metabolism, E-amino acid trans- 
port and metabolism, F-nucleotide transport and metabolism, H-coenzyme 
transport and metabolism, I lipid transport and metabolism, P-inorganic ion 
transport and metabolism, Q-secondary metabolites biosynthesis, transport 
and catabolism, R-general function prediction only, S-function unknown. 
The "size" columns gives the number of COGs in each rate group. (The 
numbers in one row do not always add up to the value in the "size" column 
because some COGs have more than one functional assignment.) 
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families with different rates in different groups. The groups were established 
in several iterations of Expectation Maximization: in an E-step, each family 
was assigned to the best group (the one whose rates give the highest likeli- 
hood), in an M-step, rates were optimized within each group separately to 
maximize the likelihood of the COG gene counts within the group's families. 
Figure 121 shows the rates in different groups (Groups 0-8), as well as the 
distribution of COG functional classes across clusters. The picture shows 
that various rate groups are needed to describe the evolution of the families. 
While the results and the methodology still need a thorough critical assess- 
ment, some interesting patterns already emerge. About 19% of the families 
are very stable (Group 8), this includes the large majority of genes involved 
in translation (category J) such as tRNA synthetases and ribosomal proteins, 
and cell cycle control (category D). About one in nine families fall into the 
groups with large horizontal transfer rates (Groups 1 and 7), while one in 
three families are in groups with very low transfer rates. There are many 
categories where duplication plays only a minor role. For instance, the evo- 
lution of cell motility (category N), and various metabolic functions (F,H,I) 
seem to be shaped mainly by horizontal transfer and loss. 



5 Conclusion 

We presented the first three-parameter model of gene content evolution, along 
with a fast algorithm for computing likelihoods. We implemented parameter 
optimization and a gene family clustering method and carried out a pilot 
experiment using COG family sizes in 51 Proteobacteria. 

We modeled gene family evolution by a birth-and-death process. It 
was shown that birth-and-death processes (as opposed to concerted evolu- 
tion) appropriately represent family evolution in at least some gene fami- 
lies (INei et al. 19971 |Michel more and Meyers 19981 IPiontkivska et al. 20 02 ) . 
In addition, birth-and-death processes of various complexity explain the ob- 
served power-law behavior of gene family sizes (|Karev et al. 2002] IKarev et al. 2 003 
IKarev et al. 2004) Reed and Hughes 2004| ). In order to develop a truly real- 



istic likelihood model, rate variation must be permitted across lineages and 
families. Our formulas can be readily adapted to branch-dependent rates. 
The challenge lies rather in the parametrization: introducing four param- 
eters (three rates and branch length) for every tree edge and every family 
will lead to overfitting. A possible solution is to work with parameters that 
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depend only on the gene family and parameters that depend only on the 
branch. These two sets of parameters are combined for each branch and 
family to infer branch-specific rates. We are now working on developing ad- 
equate rate-variation models, in a clustering-based approach as we did in 
Section 0J and by imposing rate distribution functions. In another line of 
extension, we are investigating the coupling of the model with sequence evo- 
lution models to enable a finer modeling of homologies than simple counts. 
By scoring the similarity of genes within the same family, one can arrive to 
a finer likelihood model of gene content evolution. 

It is interesting to point out that while the mathematical model assigns a 
non-zero probability to the case when the gene family has no members at any 
of the leaves, a family with no extant genes is not included usually in the data. 
Consequently, likelihood methods tend to underestimate the extent of gene 
losses. The situation is similar to what is encountered in likelihood models 
of intron evolution and a possible remedy is discussed in (ICsuros 2005 ). 

This paper focuses on the core algorithmic problems of likelihood com- 
putations in a biologically realistic model of gene content evolution. The 
presented likelihood algorithm can be utilized in a number of contexts. The 
computations can be used in parameter optimization for estimating dupli- 
cation, loss, and transfer rates in different gene families. By comparing the 
maximum likelihood values achieved with different evolutionary tree topolo- 
gies, organismal phylogeny can be derived based on gene content. "Unusual" 
branches with excess transfer, loss, etc., can be identified by examining the 
likelihoods, adapting an idea of fHa hn et al. 2005]) . The conditional like- 
lihoods of ^2.51 can be used in likelihood-based computations of ancestral 
gene content, similarly to standard methods employed in case of molecu- 
lar sequences ( |Pupko et al. 200 0). The likelihood computation enables also 
the sampling of different trees in a Bayesian Markov Chain Monte Carlo 
approach. We believe that our approach to computing exact likelihoods 
efficiently in the three-parameter model will find many applications in com- 
parative gene content analysis. 
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Appendix: organisms in the data set 



The picture below shows the organisms and the phylogeny in the experi- 
ments of Section HJ Branch lengths are already optimized to maximize the 
likelihood. Notice that branch lengths are not easy to interpret: scaling is 
defined in such a way that the rate fi = 1 in Group 0, a modestly dynamic 
group (cf. Fig. |2J). 
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Abbreviations: EcolK12 -Escherichia coli K12, Sfle -Shigella flexneri 2a 
str. 2457T, Ecol933 -Escherichia coli 0157:H7 str. EDL933, Ecol06 - 
Escherichia coli 06, Styp -Salmonella typhimurium LT2, Sent -Salmonella 
enterica subsp. enterica serovar Typhi str. CT18, Ypes - Yersinia pestis 
biovar Medievalis str. 91001, Plum -Photorhabdus luminescens subsp. lau- 
mondii TTOl, BaphSg -Buchnera aphidicola str. Sg, BaphAPS -Buchnera 
aphidicola str . APS, BaphBp -Buchnera aphidicola str. Bp, Wglo - Wigglesworthia 
glossinidia endosymbiont of Glossina brevipalpis, Bflo -[Candidatus] Blochman- 
nia floridanus, Pmul -Pasteurella multocida subsp. multocida str. Pm70, 
Hinf -Haemophilus influenzae Rd KW20, Hduc -Haemophilus ducreyi 35000HP, 
Ppro -Photobacterium profundum SS9, VvulCM - Vibrio vulnificus CMCP6, 
VvulYJ -Vibrio vulnificus YJ016, Vpar -Vibrio parahaemolyticus RIMD 
2210633, Vcho -Vibrio cholerae 01 biovar eltor str. N16961, Sone -Shewanella 
oneidensis MR-1, Psyr -Pseudomonas syringae pv. tomato str. DC3000, 
Pput -Pseudomonas putida KT2440, Paer -Pseudomonas aeruginosa PAOl, 
Cbur -Coxiella burnetii RSA 493, Xaxo -Xanthomonas axonopodis pv. citri 
str. 306, Xcam -Xanthomonas campestris pv. campestris str. ATCC 33913, 
Xfas9a -Xylella fastidiosa 9a5c, XfasTem -Xylella fastidiosa Temeculal, Neur 
-Nitrosomonas europaea ATCC 19718, NmenMC -Neisseria meningitidis MC58, 
NmenZ -Neisseria meningitidis Z2491, Cvio -Chromobacterium violaceum ATCC 
12472, Bbro -Bordetella bronchiseptica RB50, Bpar -Bordetella parapertus- 
sis 12822, Rsol -Ralstonia solanacearum GMI1000, Rpro -Rickettsia prowazekii str. 
Madrid E, Rcon -Rickettsia conorii str. Malish 7, WspM - Wolbachia en- 
dosymbiont of Drosophila melanogaster, Smel -Sinorhizobium meliloti 1021, 
Atum - Agrobacterium tumefaciens str. C58, Mlot -Mesorhizobium loti MAFF303099, 
Bsui -Brucella suis 1330, Bmel -Brucella melitensis 16M, Bjap -Bradyrhizobium 
japonicum USDA 110, Rpal -Rhodopseudomonas palustris CGA009, Cere - 
Caulobacter crescentus CB15, Bbac -Bdellovibrio bacteriovorus HD100, Dvul 
-Desulfovibrio vulgaris subsp. vulgaris str. Hildenborough, Gsul -Geobacter 
sulfurreducens PC A. 
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